1
Principles of Unconstrained Minimization
MATH008 Lesson 9
00:00
We transition from the theoretical existence of a minimum to the algorithmic engine of optimization. Our core objective is to minimize $f(x)$ (9.1) where $f: \mathbf{R}^n \to \mathbf{R}$ is convex and twice continuously differentiable. Since $f$ is differentiable and convex, a necessary and sufficient condition for a point $x^*$ to be optimal is $\nabla f(x^*) = 0$.

The Algorithmic Framework

Numerical solutions construct a minimizing sequence: A sequence of points $x^{(0)}, x^{(1)}, \dots \in \text{dom } f$ with $f(x^{(k)}) \to p^*$ as $k \to \infty$. Each step updates the position via $x^{(k+1)} = x^{(k)} + t^{(k)}\Delta x^{(k)}$, where $\Delta x$ is a descent direction.

Initialization & Convergence

The methods described in this chapter require a suitable starting point $x^{(0)}$. The starting point must lie in $\text{dom } f$, and in addition the sublevel set $S = \{x \in \text{dom } f \mid f(x) \le f(x^{(0)})\}$ must be closed. This ensures the sequence remains in a well-behaved region where the Hessian provides useful information.

Descent Directions

The simplest direction is $\Delta x = -\nabla f(x)$. However, efficiency often requires accounting for different geometries through the steepest descent direction:

  • Quadratic Norm: $\|z\|_P = (z^T P z)^{1/2} = \|P^{1/2}z\|_2$.
  • $L_\infty$ Norm: $\Delta x_{\text{sd}} = \Delta x_{\text{nsd}} \|\nabla f(x)\|_\infty = - \frac{\partial f(x)}{\partial x_i} e_i$ (Coordinate Descent).

Second-Order Models and Trust Regions

Newton's method uses a second-order Taylor approximation: $$\hat{f}(x+v) = f(x) + \nabla f(x)^T v + \frac{1}{2} v^T \nabla^2 f(x) v$$ This quadratic is minimized when $v = \Delta x_{nt}$ (the Newton step). We define a trust region: The set $\{v \mid \|v\|_2 \le \gamma\}$. The parameter $\gamma$ reflects our confidence in the second-order model. When the model is accurate, we solve for the direction via $v = L^{-T}w = -L^{-T}L^{-1}P^T g$ in KKT systems.

🎯 Core Principles of Convergence
Efficiency is measured by the speed at which the error $f(x^{(k)}) - p^*$ vanishes. For strongly convex functions, the error $f(x^{(k)}) - p^*$ converges to zero at least as fast as a geometric series. In the context of iterative numerical methods, this is called linear convergence.
  • Suboptimality Bound: $p^* \geq f(x) + \lambda(x) + \log(1 - \lambda(x))$, valid if $\lambda(x) < 1$.
  • Self-concordance Sum: If $f_1, f_2$ are self-concordant, then $f_1 + f_2$ is self-concordant.
  • Hessian Sparsity: Efficiency is gained if the Hessian band structure condition: $\nabla^2 f(x)_{ij} = 0$ for $|i-j| > k$ holds.